Encrypted Linear Regression

In this tutorial you are going to see how you can run a linear regression model on data distributed in a pool of workers with encrypted computations leveraged by Secured Multi-Party Computation. For this demonstration we are going to use the classical Housing Prices dataset that is already available in the VirtualGrid set up by the Syft Sandbox.

The idea for the implementation of the Encrypted Linear Regression algorithm in PySyft is based on the section 2 of this paper written by Jonathan Bloom of the Broad Institute of MIT and Harvard.

Author: André Macedo Farias. Github: @andrelmfarias | Twitter: @andrelmfarias

1. Preliminaries

First, let's import PySyft and PyTorch and set up the Syft sandbox, which will create all the objects and tools we will need to run our simulation (Virtual Workers, VirtualGrid with datasets, etc...)


In [ ]:
import warnings
warnings.filterwarnings("ignore")

In [ ]:
import torch
import syft as sy
sy.create_sandbox(globals(), verbose=False)

You can see that we have several workers already set up:


In [ ]:
workers

And each one has a chunk of the Housing Prices dataset:


In [ ]:
for worker in workers:
    print(worker.search(["#housing", "#data"]))

2. Encrypted Linear Regression with PySyft

2.1 Loading Housing Prices data from Virtual Grid

Now we have our Syft environment set, let's load the data.

Please note that in order to avoid overflow with the SMPC computations performed by the linear model, and to maintain its stability, we need to scale the data in a such way that the magnitude of each coordinate average lies in the interval [0.1, 10].

Usually that can be done without revealing the data or the averages, you only need to have an idea of the order of magnitude. For example, if one of the coordinate is the surface of the house and it is represented in m², you should scale it by dividing by 100, as we know the surfaces of houses have an order of magnitude close to 100 in average.

After running the model and obtaining the main statistics, we can rescale them back if needed. The same can be done with predictions.

In this tutorial I will be loading the data and scale them following this idea:


In [ ]:
scale_data = torch.Tensor([10., 10.,  10., 1., 1., 10., 100., 10., 10., 1000., 10., 1000., 10.])
scale_target = 100.0

housing_data = []
housing_targets = []
for worker in workers:
    housing_data.append(sy.local_worker.request_search(["#housing", "#data"], location=worker)[0] / scale_data.send(worker))
    housing_targets.append(sy.local_worker.request_search(["#housing", "#target"], location=worker)[0] / scale_target)

2.2 Setting up 2 more Virtual workers: the crypto provider and the "honest but curious" worker

In order to run the linear regression, we will need two more workers, a crypto provider and a honest but curious worker. Both are necessary to assure the security of the SMPC computations when we run the model in a pool with more than 3 workers.

Note: the honest but curious worker is a legitimate participant in a communication protocol who will not deviate from the defined protocol but will attempt to learn all possible information from legitimately received messages.


In [ ]:
crypto_prov = sy.VirtualWorker(hook, id="crypto_prov")
hbc_worker = sy.VirtualWorker(hook, id="hbc_worker")

2.3 Running Encrypted Linear Regression with SMPC

Now let's import the EncryptedLinearRegression from the linalg module of pysyft:


In [ ]:
from syft.frameworks.torch.linalg import EncryptedLinearRegression

Let's train the model!!


In [ ]:
crypto_lr = EncryptedLinearRegression(crypto_provider=crypto_prov, hbc_worker=hbc_worker)
crypto_lr.fit(housing_data, housing_targets)

We can display the results with the method .summarize()


In [ ]:
crypto_lr.summarize()

We can see that the EncryptedLinearRegression does not only give the coefficients and intercept values, but also their standard errors and the p-values!

3. Comparing results with other linear regressors

Now, in order to show the effectiveness of the EncryptedLinearRegression, let's compare it with the Linear Regression from other known libraries.

3.1 Sending data to local server for comparison purposes

First, let's send the data to the local worker and transform the torch.Tensors in numpy.arrays


In [ ]:
import numpy as np

data_tensors = [x.copy().get() for x in housing_data] 
target_tensors = [y.copy().get() for y in housing_targets]

data_np = torch.cat(data_tensors, dim=0).numpy()
target_np = torch.cat(target_tensors, dim=0).numpy()

3.2 Scikit-learn

First let's compare the results with the sklearn's Linear Regression:


In [ ]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(data_np, target_np.squeeze())

Display the results:


In [ ]:
print("=" * 25)
print("Sklearn Linear Regression")
print("=" * 25)
for i, coef in enumerate(lr.coef_, 1):
    print(" coeff{:<3d}".format(i), "{:>14.4f}".format(coef))
print(" intercept:", "{:>12.4f}".format(lr.intercept_))
print("=" * 25)

You can notice that the are results are pretty much the same!! The are some small differences, but they are never higher than 0.2% of the value computed by the sklearn model!!

For an ecrypted model that can compute linear regression coefficients without ever revealing the data, this is a huge achievement!

3.3 Statsmodel API

We can do the same using the Linear Regression from Statsmodel API, which also gives us the standard errors and p-values of the coefficients. We can then compare it with the results given by the EncryptedLinearRegression


In [ ]:
import statsmodels.api as sm
mod = sm.OLS(target_np.squeeze(), sm.add_constant(data_np), hasconst=True)
res = mod.fit()
print(res.summary())

Once again, we can see that all results are pretty much the same!!

Well Done!

And voilà! We were able to train an OLS Regression model on distributed data and without ever seeing it. We were even able to compute standard errors and p-values for each coefficient.

Also, after comparing our results with results given by other known libraries, we were able to validate this approach.

Congratulations!!! - Time to Join the Community!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!

Star PySyft on GitHub

The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.

Pick our tutorials on GitHub!

We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.

Join our Slack!

The best way to keep up to date on the latest advancements is to join our community!

Join a Code Project!

The best way to contribute to our community is to become a code contributor! If you want to start "one off" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked Good First Issue.

If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!